Skip to content

H5 dataloader + improved PVE computation#1

Open
cgohil8 wants to merge 5 commits intomainfrom
h5-dataloader
Open

H5 dataloader + improved PVE computation#1
cgohil8 wants to merge 5 commits intomainfrom
h5-dataloader

Conversation

@cgohil8
Copy link
Copy Markdown
Contributor

@cgohil8 cgohil8 commented Apr 18, 2026

Changes:

  • Introduce a generic h5-backed Dataset (H5Session/H5Dataset/build_h5_dataset) as a drop-in replacement for pnpl.datasets.CamcanGlasser, so any session-indexed CSV + per-session .h5 files can feed the tokenizer pipeline.
  • Rename CamcanGlasserDataModuleEphysDataModule (and _collate_camcan_collate_default) to reflect that the data module is dataset-agnostic; it only requires a ConcatDataset of per-session sub-datasets exposing .subject.
  • H5Session does per-session z-score standardisation, drops trailing samples that don't fill a window, and re-opens h5 handles per worker PID so forked DataLoader workers don't share a stale handle.
  • Rework get_pve to stream per-session sum-of-squared-error / sum-of-squared-total through the loader instead of preallocating (n_total_sequences, L, C) tensors for originals and reconstructions - unblocks PVE on datasets that don't fit in memory.
  • Increased model-architecture figure width in the README from 25% to 40%.

cgohil8 added 3 commits April 17, 2026 20:17
…odule

H5Session/H5Dataset/build_h5_dataset load windowed continuous signals from
per-session h5 files and expose the .dataset (ConcatDataset) + .subject
interface that the DataModule already consumed, so they plug in as a drop-in
replacement for pnpl.datasets.CamcanGlasser. The module (and its collate fn)
were always dataset-agnostic; renamed to reflect that.
Replace mean/std with median + 1.4826·MAD. For Gaussian data the two are
asymptotically identical (1.4826 is the MAD→σ consistency factor), but MAD is
unaffected by localised high-amplitude artefacts whose inflated σ otherwise
scales the rest of the session down and destroys reconstruction quality on
long or noisy recordings.
…l dataset

The old implementation preallocated two (n_total_sequences, L, C) float32
tensors on the compute device, which is fine for a ~50-subject CamCAN subset
but OOMs on larger datasets — e.g. ~24 GB combined for 277k windows at L=200,
C=54. Replace with a streaming pass that accumulates per-session sums of
squared error and squared total on the host (O(n_sessions) memory).
np.searchsorted maps each batch element to its session in O(log n).
cgohil8 added 2 commits April 18, 2026 09:48
Outlier sessions are now excluded upstream via the curated subset, so
robust MAD estimation is no longer needed.
@cgohil8 cgohil8 changed the title H5 dataloader H5 dataloader + improved PVE computation Apr 19, 2026
@cgohil8 cgohil8 marked this pull request as ready for review April 19, 2026 20:54
@cgohil8 cgohil8 requested a review from scho97 April 19, 2026 20:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant